-
Notifications
You must be signed in to change notification settings - Fork 560
Deprecate breadth_first() and always spawn FIFO #601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I spent some time testing this today. Full results are here: https://gist.github.com/bholley/88b0ff875061ecf7c74cef6b371d89f7 Summary of elements_matched (format: vanilla, vanilla+no-breadth-first, cuviper/fifo-spawn):
From the 'Analysis' section:
|
|
I'll provide a bit more context here about what Stylo is doing. When we style the DOM, we traverse it top-down, here: https://searchfox.org/mozilla-central/source/servo/components/style/driver.rs#62 We start doing this sequentially (breadth-first). For every node in the DOM tree, we process it and then append its children to the work queue. When the queue gets big enough, we kick over to the rayon pool, where work stealing happens on units of 16 DOM nodes. Each thread keeps a thread-local cache of previous styles it has seen that it might be able to reuse. Sharing only works between elements at the same level in the tree, which is why the breadth-first traversal is important. Moreover, since the cache is thread-local, and since sharing is more likely to be viable for nodes that are closer together, we generally want the threads to focus on distinct areas of the overall tree, rather than jumping around, to improve locality. So my interpretation of the results here is that the new algorithm is disrupting that locality, causing us to share less when there are more threads. The loss of sharing here is significant enough that it would have a measurable impact on memory usage and performance in Firefox, which is something we want to avoid. Ideally this would not involve us forking rayon, so I'm hoping we can tweak the new setup a bit to preserve the old characteristics. |
|
Thank you for testing, and for the additional context! I think your thread-local cache was an effect that I didn't understand before. My change has made a global FIFO priority for spawns, but that doesn't actually help your localized cache so much. I'll have to think more about your scenario and how we might better support it.
No worries, we're not going to change |
|
Great, thank you! Feel free to ping me if you have any questions about the peculiarities in our setup, or need any additional testing done. |
|
I'm afraid the queue-per-scope approach is not going to float. We're losing performance in three ways:
I suggest we do the following instead:
Note that only This way the three problems are fixed:
The local FIFO queues should probably be |
|
@stjepang I think you're over-simplifying it, but of course I could be over-complicating it too... :)
If we want spawns to have some global order (which is not necessarily agreed), then I think both of these points are necessary trade-offs.
While true, I think this is a micro-optimization in the grand scheme of things. Note that @bholley's analysis was primarily looking at high level metric, indicating the effectiveness of their cache. We should worry about these kind of effects first, because I expect the real workload to far outweigh these individual queue operations. The rest of your idea of per-thread LIFO and FIFO queues is essentially what we already discussed earlier in #590. My main concern about that was nested/mixed mode, noted here: #590 (comment). To restate that, suppose you have a If that inner scope only has the one shared FIFO to pull from, then it will keep pulling big outer items before it ever gets to the smaller inner items. I fear you'll end up with something like (AFAIK, Servo/Firefox folks are not nesting scopes or using |
The queue-per-scope strategy is ineffective in Firefox because, as @bholley explains, threads tend to jump around rather than focus on distinct areas of the overall tree. A thread that spawns a task must later pick it up with high probability. Another suggestion would be to have num_workers queues in each scope, which can be accessed by thread indices. But this is more complicated and scope creation becomes more costly...
Honestly, I wouldn't worry about it. Yes, the model where nested scopes are in LIFO order but tasks within scopes are in FIFO order is really wonderful. But... it's difficult to implement efficiently, nested scopes are not common in practice (don't know if they exist at all!), and the worst thing that can happen if we share queues among scopes is that some stack-allocated variables don't get dropped as soon as we'd ideally like. It's just not worth it, I think. :) |
Yeah, I think I get it now -- what I'm searching for is how to generalize this. Firefox wants it this way because they have this thread-local cache. I suppose in general you could argue that CPU caches probably benefit in a similar way, although that would favor LIFO even more. (Keep working on the thing you just touched.) The whole breadth-first thing puts us in a weird place where we care about FIFO order, but only relative to the thread, and I'm not sure this is something we can present well as a general solution.
Yeah, I was also thinking along those lines. That should be better at maintaining the current semantics, but I still feel like these are weird semantics in the first place. I get why it works for what Firefox is doing, but it's a lot of implicit behavior to depend on.
If you get a parallel-iterator/join in there too, then we're back to #590. It's also possible that these things would not be explicitly nested, but end up like that anyway because of work stealing. I'm trying to figure out something that will be robust even in such weird cases. |
|
Ok, here's another idea that takes the best of both worlds. :) Start with the design proposed in my previous comment. Each worker thread has a FIFO queue and a LIFO queue. Scopes don't have their own queues. Now we just need a tweak that prioritizes inner scopes over outer scopes. The tweak is: when a worker thread creates a new scope, it temporarily swaps out its two queues for new ones. Once the scope completes, the old queues are put back. So how do we implement this? Recall that we currently have a global We extend That's pretty much it. Now, the thread that creates a scope will clearly prioritize its tasks over tasks from other scopes. But what about other threads - won't they mix up tasks belonging to different scopes in their own queues? Fortunately, they won't! Note that a thread will steal tasks only if both of its own queues are empty. Therefore, tasks from different scopes never end up in the same queue. We could be a bit smarter about that |
You want all stealing operations to be synchronized through |
|
Anyway @stjepang, I'd be happy to see you implement your ideas in a competing PR for comparison. |
Along these lines, I'm trying to think of ways that Firefox could express this. One idea is that threads could explicitly create a new scope when they steal work. That could be done by detecting a changed The point is to make all of this easier to explain and think about how jobs will execute. We can then say that |
|
To be clear, we don't want to prevent threads from jumping to other areas of the tree if they run out of work to do. We just want to minimizing far jumps when there's still nearby work. |
|
Certainly, further work-stealing is still possible. |
|
We discussed this issue in our meeting today: I'm going to prototype the idea of using N queues (for each thread) in the scope, which should preserve the behavior Firefox wants right now. I think we should be able to get the same cache-hit performance, and then we can see how this affects raw performance numbers too. Then I want to try to present these choices as explicit variants of
Experiments from others (@stjepang) are still welcome too! |
|
So @cuviper and I discussed some on gitter. I think we decided that — first — it makes sense to tinker with a few more variants and try to measure the performance, just to validate our hypotheses. In particular, I am interested in the variant where scopes have N FIFOs, one per worker thread. This preserves the intermixed FIFO/LIFO layout, at the cost of 2 pushes/pops, but also preserves thread locality (that is, threads prefer to run work they spawned). What I like about this FIFO per variant case is that if you do the "very simple" thing of making one scope and spawning into it, you basically get the "classic work stealing" behavior, which time has shown to be a pretty decent default. I don't love the "indirection overhead" of requiring two queue operations, of course. I also worry that if one were using nested scopes, the overhead might be high, as you might have quite a lot of deques being created. More generally, it occurred to us that we could present this as variations on
The other interesting question is whether @stjepang is correct and we should not worry so much about intermingled LIFO/FIFO behavior. I could certainly imagine though that one using |
|
Argh, I did prototype those ideas shortly after discussed, but neglected to share it. That branch has implementations of three modes: |
I applied the diff of the WIP commit to my local tree (which was the one I tested with fifo-spawn). IIUC, that's the only change. Assuming I didn't mess something up, sharing behavior matches stock Firefox, so this would work great for us. Thank you! Let me know when it's released, since I'll want to remeasure to be sure everything is working properly. |
|
FYI, we're trying out a new RFC process for this in rayon-rs/rfcs#1. |
|
I think we can close this in favor of #615 |
615: Implement RFC #1: FIFO spawns r=nikomatsakis a=cuviper This implements rayon-rs/rfcs#1, adding `spawn_fifo`, `scope_fifo`, and `ScopeFifo`, and deprecating the `breadth_first` flag. Fixes #590. Closes #601. Co-authored-by: Josh Stone <[email protected]>
The former default behavior was that spawns would be processed in FIFO order from the global injector queue, but would effectively use LIFO order when a thread popped from the back of its local deque. The
breadth_first()flag made it so threads always popped from the front of their own deque, just like a stealer would, and this made spawns look breadth-first/FIFO in all cases. However, #590 showed that this had a bad effect on stack-oriented jobs like parallel iterators.Now we always make unscoped
spawnuse the global injector queue, so they're always FIFO.For scoped
spawn, it's trickier, because we still want the thread which owns that scope to prioritize its own items so it can return. So we want to use the local deque, but still achieve FIFO of itsspawnjobs. Now each scope gets its own private queue, and eachspawnis pushed on this queue and pushes a reference to that queue (ScopeQueue) as an indirect job on the local deque.When a
ScopeQueuejob is executed, it pops the real job from the front of its queue and executes that. No matter whether theScopeQueuewas popped from the back of a thread's deque or stolen from the front, the actualspawnwill be processed in a consistent FIFO order.Fixes #590.